Darwin Harbour sediment monitoring program analysis application manual

Author

Murray Logan

Published

03/06/2024

1 About

This document comprises the manual for the Darwin Harbour sediment monitoring program analysis application. It provides information on:

  • a broad overview of the structure of the application
  • the application dependencies and how to install them
  • starting the application
  • progressing through the analysis pipeline
  • visualising, interpreting and extracting outputs

2 Structural overview

R Graphical and Statistical Environment offers an ideal platform for developing and running complex statistical analyses as well as presenting the outcomes via professional graphical/tabular representations. As a completely scripted language it also offers the potential for both full transparency and reproducibility. Nevertheless, as the language, and more specifically the extension packages are community developed and maintained, the environment evolves over time. Similarly, the underlying operating systems and programs on which R and its extension packages depend (hereafter referred to as the operating environment) also change over time. Consequently, the stability and reproducibility of R codes have a tendency to change over time.

2.1 Docker containers

One way to attempt to future proof a codebase that must be run upon a potentially unpredictable operating environment is to containerise the operating environment, such that it is preserved to remain unchanged over time. Containers (specifically docker containers) are lightweight abstraction units that encapsulate applications and their dependencies within standardized, self-contained execution environments. Leveraging containerization technology, they package application code, runtime, libraries, and system tools into isolated units (containers) that abstract away underlying infrastructure differences, enabling consistent and predictable execution across diverse computing platforms.

Containers offer several advantages, such as efficient resource utilization, rapid deployment, and scalability. They enable developers to build, test, and deploy applications with greater speed and flexibility. Docker containers have become a fundamental building block in modern software development, enabling the development and deployment of applications in a consistent and predictable manner across various environments.

2.2 Shiny applications

Shiny is a web application framework for R that enables the creation of interactive and data-driven web applications directly from R scripts. Developed by Rstudio, Shiny simplifies the process of turning analyses into interactive web-based tools without the need for extensive web development expertise.

What makes Shiny particularly valuable is its seamless integration with R, allowing statisticians and data scientists to build and deploy bespoke statistical applications, thereby making data visualization, exploration, and analysis accessible to a broader audience. With its interactive and user-friendly nature, Shiny serves as a powerful tool for sharing insights and engaging stakeholders in a more intuitive and visual manner.

2.3 Git and github

Git, a distributed version control system, and GitHub, a web-based platform for hosting and collaborating on Git repositories, play pivotal roles in enhancing reproducibility and transparency in software development. By tracking changes in source code and providing a centralized platform for collaborative work, Git and GitHub enable developers to maintain a detailed history of code alterations. This history serves as a valuable asset for ensuring the reproducibility of software projects, allowing users to trace and replicate specific versions of the codebase.

GitHub Actions (an integrated workflow automation feature of GitHub), automates tasks such as building, testing, and deploying applications and artifacts. Notably, through workflow actions, GitHub Actions can build docker containers and act as a container registry. This integration enhances the overall transparency of software development workflows, making it easier to share, understand, and reproduce projects collaboratively.

Figure 1 provides a schematic overview of the relationship between the code produced by the developer, the Github cloud repositiory and container registry and the shiny docker container run by user.

Figure 1: Diagram illustrating the relationship between the code produced by the developer and the shiny docker container utilised by user with a Github cloud conduit. The developed codebase includes a Shiny R application with R backend, Dockerfile (instructions used to assemble a full operating environment) and github workflow file (instructions for building and packaging the docker image on github via actions).

3 Installation

3.1 Installing docker desktop

To retrieve and run docker containers requires the installation of Docker Desktop on Windows and MacOSx

3.1.1 Windows

The steps for installing Docker Desktop are:

  • Download the Installer: head to https://docs.docker.com/desktop/install/windows-install/ and follow the instructions for downloading the appropriate installer for your Windows version (Home or Pro).

  • Run the Installer: double-click the downloaded file and follow the on-screen instructions from the installation wizard. Accept the license agreement and choose your preferred installation location.

  • Configure Resources (Optional): Docker Desktop might suggest allocating some system resources like CPU and memory. These settings can be adjusted later, so feel free to use the defaults for now.

  • Start the Docker Engine: once installed, click the “Start Docker Desktop” button. You may see a notification in the taskbar - click it to confirm and allow Docker to run in the background.

  • Verification: open a terminal (or Powershell) and run docker --version. If all went well, you should see information about the installed Docker Engine version.

Additional Tips:

  • Ensure Hyper-V (virtualization) is enabled in your BIOS settings for optimal performance.

3.2 Installing the and running the app

The task of installing and running the app is performed via a single deploy script (deploy.bat on Windows or deploy.sh on Linux/MacOSX/wsl). For this to work properly, the deploy script should be placed in a folder along with a folder (called input) that containsthe input datasets (in excel format). This structure is illustrated below for Windows.

\
|- deploy.bat
|- input
   |- dataset1.xlsx
   |- dataset2.xlsx
Note

In the above illustration, there are two example datasets (dataset1.xlsx and dataset2.xlsx). The datasets need NOT be called dataset1.xlsx. They can have any name you choose, so long as they are excel files that adhere to the structure outlined in Section 4.1.

4 The Darwin Harbour Sediment Monitoring Program Analysis App

This Shiny application is designed to ingest very specifically structured excel spreadsheets containing Darwin Harbour sediment monitoring data and produce various analyses and visualisations. The application is served from a docker container to the localhost and the default web browser.

Docker containers can be thought of a computers running within other computers. More specifically, a container runs an instance of image built using a series of specific instructions that govern the entire software environment. As a result, containers run from the same image will operate (virtually) identically regardless of the host environment. Furthermore, since the build instructions can specify exact versions of all software components, containers provide a way of maximising the chances that an application will continue to run as designed into the future despite changes to operating environments and dependencies.

This shiny application comprises five pages (each accessable via the sidebar menu on the left side of the screen):

  1. a Landing page (this page) providing access to the settings and overall initial instructions
  2. a Dashboard providing information about the progression of tasks in the analysis pipeline
  3. a Data page providing overviews of data in various stages
  4. an Exploratory Data Analysis page providing graphical data summaries

Each page will also contain instructions to help guide you through using or interpreting the information. In some cases, this will take the from of an info box (such as the current box). In other cases, it will take the form of little symbols whose content is revealed with a mouse hover.

There are numerous stages throughout the analysis pipeline that may require user review (for example examining the exploratory data analysis figures to confirm that the data are as expected). Consequently, it is necessary for the user to manually trigger each successive stage of the pipeline. The stages are:

  • Stage 1 - Prepare environment

    More info

    This stage is run automatically on startup and essentially sets up the operating environment.

  • Stage 2 - Obtain data

    More info

    This stage comprises of the following steps:

    • reading in the excel files within the nominated input path
    • validating the input data according to a set of validation rules
    • constructing various spatial objects for mapping and spatial aggregation purposes

    The tables within the Raw data tab of the Data page will also be populated.

  • Stage 3 - Process data

    More info

    This stage comprises of the following steps:

    • apply limit of reporing values (LoRs)
    • pivot the data into a longer format that is more suitable for analysis and graphing
    • join in the metadata to each associated sheet
    • make a unique key
    • collate the all the data together from across the multiple sheets and files into a single data set
    • incorporate the spatial data
    • tidy the field names
    • apply data standardisations
    • create a site lookup table to facilitate fast incorporation of spatial information into any outputs.

    The tables within the Processed data tab of the Data page will also be populated.

  • Stage 4 - Exploratory data analysis

    More info

    This stage comprises of the following steps:

    • retrieve the processed data.
    • construct spatio-temporal design plots conditioned on initial sampling semester
    • construct variable temporal design plots conditioned on harbour zone
    • construct site level temporal trends for each variable
    • construct zone level temporal and spatial visualisations for each variable

    The exploratory data figures of the Exploratory Data Analysis page will also be populated.

  • Stage 5 - Temporal analyses

    More info

    This stage comprises of the following steps:

    • retrieve the processed data
    • prepare the data for modelling
    • prepare appropriate model formulae for each zone, variable, standardisation type
    • prepare appropriate model priors for each zone, variable, standardisation type
    • prepare appropriate model template
    • fit the models for each zone, variable, standardisation type
    • perform model validations for each zone, variable, standardisation type
    • estimate all the contrasts for each model and collate all the effects

Underneath the sidebar menu there are a series of buttons that control progression through the analysis pipeline stages. When a button is blue (and has a play icon), it indicates that the Stage is the next Stage to be run in the pipeline. Once a stage has run, the button will turn green. Grey buttons are disabled.

Clicking on button will run that stage. Once a stage has run, the button will change to either green (success), yellow (orange) or red (failures) indicating whether errors/warnings were encountered or not. If the stage was completed successfully, the bottom corresponding to the next available stage will be activated.

Sidebar menu items that are in orange font are active and clicking on an active menu item will reveal an associated page. Inactive menu items are in grey font. Menu items will only become active once the appropriate run stage has been met. The following table lists the events that activate a menu item.

Menu Item Trigger Event
Landing Always active
Dashboard Always active
Data After Stage 2
Exploratory Data Analysis After Stage 4
Analysis After Stage 5
Manual Always active

4.1 Data requirements

To be valid, input data must be excel files (*.xlsx) comprising at least the following sheets (each of which must at least have the fields listed in their respective tables):

  • metals

    Field Description Validation conditions
    Sample_ID unique sample ID must contain characters
    *¹ (mg/kg) observed concentration of metal in sediment sample must contain only numbers or start with a ‘<’ symbol

    1: where the ’*’ represents a one or two character chemical symbol (such as ‘Ag’ or ‘V’). There should be numerous of these fields

  • hydrocarbons

    Field Description Validation conditions
    Sample_ID unique sample ID must contain characters
    >C*¹ observed concentration of hydrocarbons within a specific size bin in sediment sample must contain only numbers or start with a ‘<’ symbol
    1: where the ’*’ represents a string of characters defining the size bin (such as ’10 _C16’). There should be numerous of these fields
  • total_carbons

    Field Description Validation conditions
    Sample_ID unique sample ID must contain characters
    TOC (%) observed total organic carbon (as a percentage of the sample weight) must contain only numbers
  • metadata

    Field Description Validation conditions
    IBSM_site name of the site from the perspective of IBSM must contain characters (or be blank)
    Site_ID a unique site ID must contain characters (cannot be blank)
    Sample_ID unique sample ID (the key to data sheets) must contain characters (cannot be blank)
    Original_SampleID unique sample ID must contain characters
    Latitude site latitude must be numeric (and negative)
    Longitude site longitude must be numeric
    Acquire_date_time date and time sample was collected (D/M/YYYY hh:mm:ss) must be in datetime format
    Sampler name of person responsible for collecting sample (ignored) ignored
    Notes project description (ignored) ignored
    Baseline_site the unique site ID of the corresponding baseline sample must contain characters (cannot be blank)
    Baseline_acquire_date_site the date and time of the corresponding baseline sample must be in datetime format
  • notes - this sheet is not processed or validated

4.2 Landing page

To run this tool, please adhere to the following steps:

  1. review the Path Settings (specifically checking the “Data input dir” and ensuring that there is at least one data file listed in the box under this setting
  2. review the Run Settings. In particular,
    • consider whether you need to Clear the previous data - clicking the button to do so. Clearing the previous data deletes all cache and ensure that the analyses are performed fresh. This is recommended whenever the input data changes. Not clearing the previous data allows the user to skip directly to later run stages if earlier stages have already been run.
    • consider the Limit of Reporting (LoR) setting.
      • the default is to set the value equal to the specified limit of reporting for a value (such values must start with a “<”) and will be flagged as “left” censored. Models that accommodate censored data take a probabilistic approach to inferring the likely distribution of all observations including those beyond the limit of reporting and are considered more appropriate.
      • the alternative is the more traditional approach of replacing the value with 1/2 of the limit of reporting value and using this in the analyses. Whilst traditional, this approach tends to make the resulting values into outliers and thus problematic in analyses.
  3. navigate the Dashboard via the menu on the left sidebar

4.3 Dashboard

The analysis pipeline comprises numerous Stages, each of which is made up of several more specific Tasks. The individual Tasks represent an action performed in furtherance of the analysis and of which there are reportable diagnostics. For example, once the application loads, the first Stage of the pipeline is to prepare the environment. The first Task in this Stage is to load the necessary R packages used by the codebase. Whilst this technically, this action consists of numerous R calls (one for each package that needs to be loaded), the block of actions are evaluated as a set.

Initially, all tasks are reported as “pending” (). As the pipeline progresses, each Task is evaluated and a status is returned as either “success” () or “failure” ().

The Stage that is currently (or most recently) being run will be expanded, whereas all other Stages will be collapsed (unless they contain errors). It is also possible to expand/collapse a Stage by double clicking on its title (or the small arrow symbol at the left side of the tree).

As the pipeline progresses, Task logs are written to a log_file and echoed to the Logs panel. Each row represents the returned status of a specific Task and are formatted as:

  • the time/date that the Task was evaluated
  • the Task status, which can be one of:
    • SUCCESS the task succeeded
    • FAILURE the task failed and should be investigated
    • WARNING the task contained a warning - typically these can be ignored as they are usually passed on from underlying routines and are more targetted to developers than users.
  • the Stage followed by the Task name
  • in the case of errors and warnings, there will also be the error or warning message passed on from the underlying routines. These can be useful for helping to diagnose the source and cause of issues.

The Logs in the Log panel are presented in chronological order and will autoscroll such that the most recent log is at the bottom of th e display. If the number of Log lines exceeds 10, a scroll bar will appear on the right side of the panel to help reviewing earlier Logs.

Note

The Status and Logs are completely refreshed each time the application is restarted.

The Progress panel also has a tab (called Terminal-like) also has an alternative representation of the status and progress of the pipeline.

4.4 Data